# **CUDA Programming**

#### **Outline**

- **□**GPU
- □CUDA Introduction
  - **□What is CUDA**
  - □CUDA Programming Model
  - **□**Advantages & Limitations
- □CUDA Programming
- ☐Future Work

#### **GPU**

- □ GPUs are massively multithreaded many core chips
  □ handle computation only for computer graphics
  □ Hundreds of processors
  □ Tens of thousands of concurrent threads
  □ TFLOPs peak performance
  □ Fine-grained data-parallel computation
  □ data Parallolosim
- ☐ Users across science & engineering disciplines are achieving tenfold and higher speedups on GPU











# ق الأصل للرمومات ها في الأصل للرمومات ها في الأصل الاستفدامات حك الاستفدامات المستفدامات المستفدامات

- What is GPGPU?
  - General purpose computing on GPUs
  - GPGPU is the use of a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).
- Why GPGPU?
  - Massively parallel computing power
  - Inexpensive

#### **GPGPU**

How?



#### What is CUDA?

- □ CUDA is the acronym for Compute Unified Device Architecture.
   □ A parallel computing architecture developed by NVIDIA.
   □ Heterogeneous serial-parallel computing
   □ The computing engine in GPU.
   □ CUDA can be accessible to software developers through industry standard programming languages.
- ☐ CUDA gives developers access to the instruction set and memory of the parallel computation elements in GPUs.

# **Heterogeneous Computing**

Host The CPU and its memory (host memory)

Device The GPU and its memory (device

**Device** The GPU and its memory (device memory)

GPU Pevice, con host





Device

# **Heterogeneous Computing**

```
#include <iostream>
#include <algorithm>
#define RADIUS 3
#define BLOCK SIZE 16
__global__ void stencil_1d(int *in, int *out) {
    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
    int gindex = threadIdx.x + blockIdx.x * blockDim.x;
        int lindex = threadIdx.x + RADIUS;
        // Read input elements into shared memory
        temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
               temp[lindex - RADIUS] = in[gindex - RADIUS];
               temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
        // Synchronize (ensure all the data is available)
                                                                                        parallel fn
        __syncthreads();
        // Apply the stencil
        for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
               result += temp[lindex + offset];
        out[gindex] = result;
void fill_ints(int *x, int n) {
        fill_n(x, n, 1);
int main(void) {
    int *in, *out;
                         // host copies of a, b, c
        int *d_in, *d_out;
                            // device copies of a, b, c
       int size = (N + 2*RADIUS) * sizeof(int);
        // Alloc space for host copies and setup value
        in = (int *)malloc(size): fill_ints(in_N + 2*RADIUS):
        out = (int *)malloc(size); fill ints(out, N + 2*RADIUS);
        // Alloc space for device copies
        cudaMalloc((void **)&d_in, size);
        cudaMalloc((void **)&d_out, size);
                                                                                        serial code
        cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
        cudaMemcpy(d out, out, size, cudaMemcpyHostToDevice);
        stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
        // Copy result back to host
        cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
                                                                                        parallel code
        cudaFree(d_in); cudaFree(d_out);
        return 0:
                                                                                         serial code
```

#### **CUDA Kernels and Threads**

☐ Parallel portions of an application are executed on the device as kernels

a cod that run in parallel

- ☐ A Kernel is a Function that runs on a device
  - One kernel is executed at a time
  - ☐ Many threads execute each kernel

☐ Differences between CUDA and CPU threads

[ WY & [ Will ]

- □ CUDA threads are extremely lightweight
  - Very little creation overhead ~
  - ☐ Instant switching

- ☐ A kernel is executed by a grid of thread blocks
- □ A thread block is a batch of threads that can cooperate with each other by:
  - ☐ Sharing data through shared memory
  - **□**Synchronizing their execution

☐ Threads from different blocks cannot cooperate

\* threads from same black can cooperate



- All threads within a block can
  - Share data through 'Shared Memory'
  - Synchronize using '\_syncthreads()'

امر إذا وصل له ال thread كا منتقر الد thread كا الباقين يخلصون

- Threads and Blocks have unique IDs
  - Available through special variables

كندهم لأ محيل رجم ف كل واحد أيثي الم المعلم ال كنده

SIMT (Single Instruction Multiple Threads)
 Execution

Threads run in groups of 32 called warps



© NVIDIA Corporation

#### Single Instruction Multiple Thread (SIMT) Execution:

- Groups of 32 threads formed into warps
  - always executing same instruction
  - share instruction fetch/dispatch

  - hardware automatically handles divergence
- Warps are primitive unit of scheduling
  - · all warps from all active blocks are time-sliced





# **Control Flow Divergence**



الرافين ما ينتقرون يعسير له المعالم الرافيين ما ينتقرون يعسير له المعالم المن بخلص المعالم المن بخلص المعالم المعالم

جر کی الطهمال ما (العبه الله والله کا الده الله والله کا الده الله والله کا الده والله کا الده والله کا الده والله کا الده والله کا الله کا اله



© NVIDIA 2013

# Simple Processing Flow



ONVIDIA 2018 the Land I have

201

## Simple Processing Flow



Wigner Jula JI guin

Sieving,

# Memory Model

- Types of device memory
  - Registers read/write per-thread
  - Local Memory read/write per-thread
  - Shared Memory read/write per-block
  - Global Memory read/write across grids
  - Constant Memory read across grids
  - Texture Memory read across grids



© NVIDIA Corporation

#### There are 6 Memory Types:

- Registers
  - on chip
  - fast access
  - per thread
  - limited amount
  - 32 bit



#### There are 6 Memory Types:

- Registers
- Local Memory
  - o in DRAM
  - slow
  - non-cached
  - per thread
  - relative large

Joeal Variables usings



#### There are 6 Memory Types:

- Registers
- Local Memory
- Shared Memory
  - on chip
  - fast access
  - per block
  - 16 KByte
  - synchronize between threads









#### There are 6 Memory Types:

- Registers
- Local Memory
- Shared Memory
- Global Memory
- Constant Memory
- Texture Memory
  - in DRAM
  - cached
  - per grid
  - read-only



- Registers
- Shared Memory
  - o on chip
- Local Memory
- Global Memory
- Constant Memory
- Texture Memory
  - in Device Memory



- Global Memory
- Constant Memory
- Texture Memory
  - o managed by host code
  - o persistent across kernels



# Advantages of CUDA

- □CUDA has several advantages over traditional general purpose computation on GPUs:
  - □Scattered reads code can read from arbitrary addresses in memory.
  - □Shared memory CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads.

#### **Limitations of CUDA**

□ CUDA has several limitations over traditional general purpose computation on GPUs:
 □ A single process must run spread across multiple disjoint memory spaces, unlike other C language runtime environments.
 □ The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.
 □ CUDA-enabled GPUs are only available from NVIDIA.